This project tries to fit a machine learning algorithm to the Weight
Lifting Exercises Data (WLE) for the purpose of predicting the type
of exercise performed. The data for this project comes from barbell
lifts done by 6 participants. There measurements are taken by devices
like Jawbone Up, Nike Fuelband, and Fitbit.
More information about the WLE
Dataset can be found here.
See the section on the weight lifting exercises
dataset.
The training data contains information on more than a hundred and
fifty variables. It contains more than nineteen thousand samples of
observations and can be downloaded from link given below.
Training
Dataset
This dataset is used to create a model for this project.
The testing data contains 20 samples of observations on more than a
hundred and fifty variables. It is used to evaluate the out of sample
error of the final machine learning algorithm for this project. It can
be downoaded from the likn given below.
Testing
Dataset
We need to perform a little bit of exploratory data analysis on the training data before modelling it. Let’s read it in first.
url_training <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url_testing <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
if(!file.exists('training.csv')){
download.file(url_training, destfile = 'training.csv')
download.file(url_testing, destfile = 'testing.csv')
}
training <- read.csv('training.csv')
testing <- read.csv('testing.csv')
Now that we have got our data, its time to explore it. According to the information provided the observations are taken by accelerometres placed at various locations, like arm, barbell, etc. The first seven variables are thus, not related to the outcome of interest here ((classe variable)). So, its better to remove them.
# remove first 7 columns from training
train <- training[ , -c(1:7)]
# convert each column to class numeric
for(i in 1:152){
train[, i] <- as.numeric(train[, i])
}
Some of the variables contain a large proportion of NA values which can be easily seen. These need to be removed as they don’t contribute much to the model.
# Columns with less than 1000 non NA values removed
train <- train[, (colSums(!is.na(train)) > 1000)]
# look at the data
library(knitr)
knitr::kable(train[1:5, 1:5], caption = "First 5 Columns are now not empty")
| roll_belt | pitch_belt | yaw_belt | total_accel_belt | gyros_belt_x |
|---|---|---|---|---|
| 1.41 | 8.07 | -94.4 | 3 | 0.00 |
| 1.41 | 8.07 | -94.4 | 3 | 0.02 |
| 1.42 | 8.07 | -94.4 | 3 | 0.00 |
| 1.48 | 8.05 | -94.4 | 3 | 0.02 |
| 1.48 | 8.07 | -94.4 | 3 | 0.02 |
We have narrowed ourselves to 45 variables. We need to select only those variables that explains the variation to the maximum extent. For this we use principal component analysis to capture 95 percent of the variation.
library(caret)
preProc <- preProcess(train[, -46], method = "pca")
# remove columns vith low contribution to the variation
train_a <- predict(preProc, train)
train_a$classe <- as.factor(train$classe)
Now, we use this processed data to create a model.
Before fitting a model let’s specify the resampling method and required parameters.
library(caret)
# resampling method is Cross Validation with 10 folds
fitControl <- trainControl(method = "cv", number = 10, allowParallel = T) # allow parallel processing
Split the data into training and testing sets.
library(caret)
set.seed(420)
inTrain <- createDataPartition(train_a$classe, p = .7, list = F)
train <- train_a[inTrain, ]
test <- train_a[-inTrain, ]
Fit a random forest to the data.
library(caret)
x <- train_a[, -1]
y <- train_a[, 1]
# fit the random forest fit
set.seed(920)
rfFit <- train(x, y, method = "rf", trControl = fitControl)
Now we have a random forest fit that is done is such a small time using parallel processing. We can check the accuracy on the held out cross validation sets.
rfFit$resample
## Accuracy Kappa Resample
## 1 0.9737991 0.9668696 Fold02
## 2 0.9687273 0.9604192 Fold01
## 3 0.9716157 0.9640805 Fold03
## 4 0.9752547 0.9686974 Fold06
## 5 0.9687045 0.9604043 Fold05
## 6 0.9708242 0.9630670 Fold04
## 7 0.9774217 0.9714331 Fold07
## 8 0.9716157 0.9640709 Fold10
## 9 0.9723837 0.9650739 Fold09
## 10 0.9642857 0.9547931 Fold08
confusionMatrix.train(rfFit)
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction A B C D E
## A 28.2 0.5 0.0 0.0 0.0
## B 0.1 18.5 0.3 0.0 0.1
## C 0.1 0.3 17.0 0.8 0.2
## D 0.1 0.0 0.1 15.5 0.1
## E 0.0 0.0 0.0 0.0 17.9
##
## Accuracy (average) : 0.9715
Also on the held out test set.
confusionMatrix(predict(rfFit, test), test$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1197 0 0 0 0
## B 0 752 0 0 0
## C 0 0 739 0 0
## D 0 0 0 659 0
## E 0 0 0 0 782
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9991, 1)
## No Information Rate : 0.2899
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.000 1.0000 1.0000
## Prevalence 0.2899 0.1821 0.179 0.1596 0.1894
## Detection Rate 0.2899 0.1821 0.179 0.1596 0.1894
## Detection Prevalence 0.2899 0.1821 0.179 0.1596 0.1894
## Balanced Accuracy 1.0000 1.0000 1.000 1.0000 1.0000
Next, we fit a decistion trees model to the data.
library(caret)
# fit a decision tree model
set.seed(920)
dtFit <- train(x, y, method = "rpart", trControl = fitControl)
# check it out
library(rattle)
fancyRpartPlot(dtFit$finalModel, palettes = c("Greys", "Oranges"), caption = "columns are named 'PC-col no.' becuase of PCA")
Let’s see how this method performed. First, on the held out sets of resamples.
confusionMatrix.train(dtFit)
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction A B C D E
## A 21.4 7.6 10.7 4.8 5.8
## B 3.3 7.3 2.0 7.5 5.2
## C 2.7 1.9 3.8 1.2 1.5
## D 0.8 2.0 0.6 1.8 1.4
## E 0.3 0.5 0.4 1.1 4.5
##
## Accuracy (average) : 0.3876
And then on the held out test set.
confusionMatrix(predict(dtFit, test), test$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1030 393 631 259 313
## B 162 345 92 349 287
## C 0 0 0 0 0
## D 0 0 0 0 0
## E 5 14 16 51 182
##
## Overall Statistics
##
## Accuracy : 0.3771
## 95% CI : (0.3623, 0.3921)
## No Information Rate : 0.2899
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.1682
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8605 0.45878 0.000 0.0000 0.23274
## Specificity 0.4557 0.73645 1.000 1.0000 0.97431
## Pos Pred Value 0.3922 0.27935 NaN NaN 0.67910
## Neg Pred Value 0.8889 0.85936 0.821 0.8404 0.84460
## Prevalence 0.2899 0.18213 0.179 0.1596 0.18939
## Detection Rate 0.2495 0.08356 0.000 0.0000 0.04408
## Detection Prevalence 0.6360 0.29910 0.000 0.0000 0.06491
## Balanced Accuracy 0.6581 0.59761 0.500 0.5000 0.60352
Next, we fit a GBM model to the data using parallel processing.
# fit a GBM model
set.seed(920)
gbmFit <- train(x, y, method = "gbm", verbose = F, trControl = fitControl)
Let’s see how it performs. First, on the held out test sets in the folds.
confusionMatrix.train(gbmFit)
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction A B C D E
## A 25.6 2.3 1.0 0.6 0.5
## B 0.6 14.5 1.1 0.2 1.5
## C 0.9 1.5 14.4 1.9 1.3
## D 1.2 0.5 0.7 13.3 0.8
## E 0.2 0.5 0.2 0.4 14.3
##
## Accuracy (average) : 0.8216
Second, on the held out test set.
confusionMatrix(predict(gbmFit, test), test$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1110 89 26 11 15
## B 20 577 39 3 43
## C 25 63 635 78 45
## D 37 11 33 559 38
## E 5 12 6 8 641
##
## Overall Statistics
##
## Accuracy : 0.853
## 95% CI : (0.8418, 0.8637)
## No Information Rate : 0.2899
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8135
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9273 0.7673 0.8593 0.8483 0.8197
## Specificity 0.9519 0.9689 0.9378 0.9657 0.9907
## Pos Pred Value 0.8873 0.8460 0.7506 0.8245 0.9539
## Neg Pred Value 0.9698 0.9492 0.9683 0.9710 0.9592
## Prevalence 0.2899 0.1821 0.1790 0.1596 0.1894
## Detection Rate 0.2688 0.1397 0.1538 0.1354 0.1552
## Detection Prevalence 0.3030 0.1652 0.2049 0.1642 0.1628
## Balanced Accuracy 0.9396 0.8681 0.8985 0.9070 0.9052
Next, we fit a SVM model to the data using parallel processing again.
# fit the svm model
set.seed(920)
svmFit <- train(x, y, method = "svmRadial", trControl = fitControl)
Let’s see how the fit performs. First, on the held out test sets in folds.
confusionMatrix.train(svmFit)
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction A B C D E
## A 28.0 1.8 0.2 0.2 0.0
## B 0.2 16.1 0.7 0.1 0.3
## C 0.2 1.3 15.7 1.7 1.0
## D 0.1 0.2 0.7 14.4 0.7
## E 0.0 0.1 0.1 0.0 16.3
##
## Accuracy (average) : 0.9049
Second, on the held out test set.
confusionMatrix(predict(svmFit, test), test$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 0 0 0 0 0
## B 0 0 0 0 0
## C 0 0 0 0 0
## D 0 0 0 0 0
## E 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : NaN
## 95% CI : (NA, NA)
## No Information Rate : NA
## P-Value [Acc > NIR] : NA
##
## Kappa : NaN
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity NA NA NA NA NA
## Specificity NA NA NA NA NA
## Pos Pred Value NA NA NA NA NA
## Neg Pred Value NA NA NA NA NA
## Prevalence NaN NaN NaN NaN NaN
## Detection Rate NaN NaN NaN NaN NaN
## Detection Prevalence NaN NaN NaN NaN NaN
## Balanced Accuracy NA NA NA NA NA
Lastly, let us fit a regularized discriminant analysis on the data.
# fit the RDA model
set.seed(920)
rdaFit <- train(x, y, method = "rda", trControl = fitControl)
Let’s see how the model performs. First, on the held out test sets in the folds.
confusionMatrix.train(rdaFit)
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction A B C D E
## A 20.7 0.9 0.2 0.3 0.1
## B 2.3 15.1 2.0 1.0 1.9
## C 4.1 2.7 14.9 3.8 1.0
## D 1.1 0.2 0.3 10.9 1.9
## E 0.2 0.4 0.1 0.4 13.5
##
## Accuracy (average) : 0.7513
Second, on the held out test set.
confusionMatrix(predict(rdaFit, test), test$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 884 34 4 11 2
## B 99 583 74 34 73
## C 161 108 644 166 41
## D 47 10 12 437 97
## E 6 17 5 11 569
##
## Overall Statistics
##
## Accuracy : 0.7549
## 95% CI : (0.7415, 0.768)
## No Information Rate : 0.2899
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6923
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.7385 0.7753 0.8714 0.6631 0.7276
## Specificity 0.9826 0.9171 0.8596 0.9522 0.9883
## Pos Pred Value 0.9455 0.6756 0.5750 0.7247 0.9359
## Neg Pred Value 0.9020 0.9483 0.9684 0.9370 0.9395
## Prevalence 0.2899 0.1821 0.1790 0.1596 0.1894
## Detection Rate 0.2141 0.1412 0.1560 0.1058 0.1378
## Detection Prevalence 0.2264 0.2090 0.2713 0.1460 0.1473
## Balanced Accuracy 0.8606 0.8462 0.8655 0.8076 0.8580
library(jpeg)
library(ggplot2)
library(patchwork)
df <- rbind(rfFit$resample, dtFit$resample, gbmFit$resample, svmFit$resample, rdaFit$resample)
df$Model <- c(rep("Random Forest", 10), rep("Decision Trees", 10), rep("Gradient Boosting Machine", 10), rep("Support Vector Machine", 10), rep("Regularized Discriminant Analysis", 10))
# read in background
img <- readJPEG('forest.jpg', native = T)
# plot...
g <- ggplot(df, aes(Resample, Accuracy))
g <- g + geom_point(aes(color = Model), size = 4)
g <- g + theme(legend.background = element_rect(fill = alpha('brown', .5)), legend.text = element_text(face = 'italic', family = 'serif', size = 14, colour = 'white'), legend.title = element_text(colour = 'white', size = 16), legend.key = element_rect(fill = alpha('white', .1), color = 'white'))
g <- g + theme(panel.background = element_rect(fill = alpha('white', .1), colour = alpha('white', .1)))
g <- g + theme(axis.text = element_text(colour = 'white', size = 12), axis.ticks = element_line(colour = alpha('green', .1)), axis.text.x = element_text(angle = 90), axis.title = element_text(face = c('italic', 'bold'), family = 'serif', size = 18, colour = 'white'))
g <- g + theme(panel.grid.major = element_line(color = alpha('green', .1)))
g <- g + inset_element(img, left = -.1, right = 1.1, top = 1.1, bottom = -.1, align_to = "full", on_top = F)
g
Random forests perform excellent and way better compared to other models. SVM also performs well.
The accuracies on the test set have been computed already using the
confusionMatrix() function. Best performing model is Random
Forest with an accuracy of 1, i.e. an error rate of
zero percent.
Since random forests performed so well, we don’t need to create an
ensemble of models. RF is enough. We are gonna use it to predict the
validation set of 20 samples.
Now, for the final part we are going to predict the validation set (20 samples) using the random forest approach.
# preprocess the testing data
testing <- testing[,-c(1:7)]
for(i in 1:152){
testing[, i] <- as.numeric(testing[, i])
}
testing <- testing[, (colSums(!is.na(testing)) > 0)]
testing <- testing[, -corVars]
preProc2 <- preProcess(testing, method = "pca")
testing <- predict(preProc2, testing)
# predict
predict(rfFit, testing)
## Error in predict.randomForest(modelFit, newdata): variables in the training data missing in newdata
This gives an error becuase the principal component analysis gave 12
components in the case of validation set. These were 25 while fitting a
model in on the train dataset.
So, we are going to refit the RF fit to the train set but with 12
components this time.
preProc_modified <- preProcess(train_b, method = "pca", pcaComp = 12)
train_b <- predict(preProc_modified, train_b)
train_b$classe <- as.factor(train_b$classe)
train_b <- train_b[inTrain, ]
test_b <- train_b[-inTrain, ]
x <- train_b[ , -1]
y <- train_b[ , 1]
# RF fit
set.seed(920)
rfFit_modified <- train(x, y, method = "rf", trControl = fitControl)
# Does it still perform good?
confusionMatrix.train(rfFit_modified)
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction A B C D E
## A 27.7 0.6 0.2 0.2 0.1
## B 0.2 18.1 0.3 0.1 0.2
## C 0.3 0.4 16.6 0.9 0.2
## D 0.2 0.2 0.3 15.2 0.2
## E 0.0 0.0 0.1 0.0 17.8
##
## Accuracy (average) : 0.9548
confusionMatrix(predict(rfFit_modified, test_b), test_b$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1197 0 0 0 0
## B 0 752 0 0 0
## C 0 0 739 0 0
## D 0 0 0 659 0
## E 0 0 0 0 782
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9991, 1)
## No Information Rate : 0.2899
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.000 1.0000 1.0000
## Prevalence 0.2899 0.1821 0.179 0.1596 0.1894
## Detection Rate 0.2899 0.1821 0.179 0.1596 0.1894
## Detection Prevalence 0.2899 0.1821 0.179 0.1596 0.1894
## Balanced Accuracy 1.0000 1.0000 1.000 1.0000 1.0000
Yep, still got it.
# Now, predict the validation set
testing$classe <- NA
predict(rfFit_modified, testing)
## [1] A A B E A B E E E C A D E E B E E E B B
## Levels: A B C D E